Genome Biology — Latest Matching Preprints

1

dreampy: Pseudobulk mixed-model differential expression for single-cell RNA-seq in Python

Wells, S. B.; Shahnawaz, H.; Jones, J. L.

2026-03-24 bioinformatics 10.64898/2026.03.21.713408 medRxiv

Top 0.1%

22.3%

Show abstract

dreampy is a Python implementation of the R dreamlet framework for pseudobulk differential expression analysis of single-cell RNA-seq data. dreamlet combines voom precision-weighted linear mixed models with empirical Bayes moderation to handle batch effects, repeated measures, and other hierarchical structure in multi-donor studies, but exists entirely within the R/Bioconductor ecosystem. dreampy reproduces this pipeline natively in Python, integrating with AnnData and the scverse ecosystem.

2

MiCBuS: Marker Gene Mining for Unknown Cell Types Using Bulk and Single Cell RNA-Seq Data

Zhang, S.; Lu, Y.; Luo, Q.; An, L.

2026-03-24 bioinformatics 10.64898/2026.03.20.711946 medRxiv

Top 0.1%

19.3%

Show abstract

Identifying cell type-specific expressed genes (marker genes) is essential for understanding the roles and interactions of cell populations within tissues. To achieve this, the traditional differential analysis approaches are often applied to individual cell-type bulk RNA-seq and single-cell RNA-seq data. However, real-world datasets often pose challenges, such as heterogeneous bulk RNA-seq and incomplete scRNA-seq. Heterogeneous bulk RNA-seq amalgamates gene expression profiles from multiple cell types and results in low resolution, while incomplete scRNA-seq does not capture some cell types from the tissue, leading to unknown cell types. Traditional methods fail to identify marker genes for such unknown cell types. MiCBuS addresses this limitation by generating Dirichlet-pseudo-bulk RNA-seq based on bulk and incomplete single-cell RNA-seq data. By performing differential analysis of gene expressions on bulk and Dirichlet-pseudo-bulk RNA-seq samples, MiCBuS can identify the marker genes of unknown cell types, enabling the identification and characterization of these elusive cellular components. Simulation studies and real data analyses demonstrate that MiCBuS reliably and robustly identifies marker genes specific to unknown cell types, a capability that traditional differential analysis methods cannot achieve. Availability and implementationMiCBuS is implemented in the R language and freely available at https://github.com/Shanshan-Zhang/MiCBuS.

3

TopOmics: Topic Modelling for All Omics

Sanguinetti, G.; El Kazwini, N.; Caretti, F.

2026-05-29 bioinformatics 10.64898/2026.05.26.727810 medRxiv

Top 0.1%

18.4%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWTopic models have emerged as a popular paradigm to analyse and interpret complex single-cell and spatial data. Yet, current implementations are usually data-type specific and rely on different modelling and estimation approaches, hindering usability and interoperability. In this work we introduce TopOmics, a library to perform efficient and flexible topic modeling with any combination of -omics data at scale. The framework leverages standard libraries of the Python ecosystem, guaranteeing seamless integration with existing pipelines, and shows competitive performance against state-of-the-art methods while preserving interpretability. We provide several examples of TopOmics on diverse data sets, including a novel topic model for spatial multi-omic data, and an analysis of a very large VisiumHD data set.

4

Monju: Multi-criteria clustering in single-cell omics

Kaneko, T.; Sakaguchi, S.; Fujioka, S.; Yada, Y.; Kojima, R.; Naoki, H.

2026-06-01 bioinformatics 10.64898/2026.05.28.728427 medRxiv

Top 0.1%

17.9%

Show abstract

Clustering is a fundamental step in single-cell omics analysis. Although single-cell omics data can, in principle, be partitioned according to multiple biologically meaningful criteria, existing methods typically cluster cells using a single criterion. To address this problem, we developed Monju, a multi-criteria clustering method based on a deep generative mixture model. Monju divides cells into biologically reasonable submodels, each of which is equipped with an interpretable latent space. Furthermore, although the partitioning of cells into submodels varies across random seeds, each solution remains biologically plausible, collectively yielding multi-criteria clustering. Moreover, by integrating these multiple clustering solutions to perform meta-clustering, Monju enables the assessment of cluster stability. We applied Monju to human peripheral blood CITE-seq data and demonstrated that it can achieve multi-criteria clustering. Monju therefore provides a powerful and practical framework for dissecting cellular heterogeneity from multiple biological perspectives.

5

X-Plat: A polynomial regression based tool for cross-platform transformation of expression and methylation data

Krishnan, N. M.; Rahman, S. I.; Olsen, L. R.; Panda, B.

2026-03-30 genomics 10.64898/2026.02.22.707273 medRxiv

Top 0.1%

17.2%

Show abstract

Many biological studies could benefit from combining data from legacy microarray and high throughput sequencing platforms, especially in clinical domains where collecting additional samples is not possible. However, incompatibility between platforms makes legacy data difficult to integrate, owing to differences in platform design, target preparation, and dependence on prior annotations. Here, we describe X Plat, a cross platform data transformation tool for both expression and methylation assays that inter converts data between microarray and sequencing platforms using per gene second degree polynomial regression. X Plat learns cross platform conversion rules from paired microarray sequencing datasets spanning multiple conditions, sample sources, organisms, and platforms, and evaluates performance using cross validated root mean square error (RMSE) per gene. In rat, Arabidopsis, and human datasets, X Plat achieved lower cross validated RMSE than TDM, HARMONY, and HARMONY2 for the vast majority of genes (equal to or greater than 95% in all sequencing to array transformations and most array to sequencing transformations, with nearly 82% in the Arabidopsis array to sequencing setting), and these findings were confirmed using RMSE on held out test samples from the first cross validation fold. X Plat also achieved low RMSE (less than or equal to 0.2) for the majority of CpG regions in paired human array and sequencing methylation datasets. Using X Plat, users can transform data between microarray and high throughput sequencing platforms, enabling cross platform comparison and reuse of legacy cohorts.

6

LongAllele: a joint inference framework for allele-specific analysis on long-read bulk and single-cell RNA sequencing

Xu, Z.; Wang, K.

2026-05-08 bioinformatics 10.64898/2026.05.05.722992 medRxiv

Top 0.1%

17.1%

Show abstract

Allele-specific analysis from RNA-seq is a powerful approach to characterize cis-regulatory effects. However, existing methods remain limited in both haplotype inference and allelic testing. Their haplotype-inference workflows separate variant calling, haplotype phasing, and read-haplotype assignment into sequential steps, failing to fully exploit within-read SNV linkage information and propagating errors into downstream allelic analysis. At the testing stage, they ignore non-phasable reads lacking heterozygous SNVs, biasing calls and inflating false positives, and remain incomplete across gene-, isoform-, and local-event-level variant effects. Here, we present LongAllele, a statistical framework that employs an expectation-maximization algorithm to jointly infer heterozygous variants, haplotype structure, and read-haplotype assignments from long-read bulk and single-cell RNA sequencing. LongAllele further introduces phasability-aware testing that explicitly accounts for non-phasable reads, avoiding inflated false-positive calls when haplotype information is incomplete. It also enables comprehensive allelic testing across gene-level ASE, isoform-level allele-specific transcript usage (ASTU), and local-event-level haplotype-associated exon and junction usage (HAEU and HAJU), providing a multi-scale view of cis-regulation. We applied LongAllele to long-read RNA-seq datasets spanning GTEx (multi-tissue bulk), peripheral blood mononuclear cells (single-cell), and human hippocampus (single-nucleus). LongAllele consistently revealed greater tissue and cell-type variability in expression-level than isoform-level allelic regulation, pinpointed high-impact regulatory variants including rare splice-site mutations missed by standalone variant callers, and showed that purifying selection constrains allelic imbalance at both gene and isoform levels. LongAllele offers a unified framework for haplotype-resolved cis-regulatory analysis across diverse cellular contexts.

7

Hidden Diversity in Yeast tRNAs: Comparative Genomics and Modification Mapping in a Eukaryotic Subphylum

Dineen, L.; Wilson, D.; LaBella, A. L.

2026-03-21 genomics 10.64898/2026.03.20.712421 medRxiv

Top 0.1%

16.8%

Show abstract

tRNA are adapter molecules with an integral role in translation and further roles in stress adaptation. Processing of tRNA is tightly regulated and includes the enzymatic addition of several post-transcriptional modifications that are required for translation efficiency, recognition, selective translation, and structure. We currently lack a multi-species wide view of tRNA modifying enzymes across eukaryotes. Here, we performed a comparative analysis of tRNA gene sequence, modification enzymes, and modification profiles across the Saccharomycotina subphylum. We employed machine learning methods to explore tRNA sequence conservation and to annotate modifying enzymes known to exist in fungi, humans, and prokaryotes. We then applied Nano-tRNAseq to three species (Saccharomyces cerevisiae, Hanseniaspora uvarum, and Yarrowia lipolytica) to profile modification signatures and compare modification patterns. We identified substantial lineage-specific conservation of tRNA sequences despite the highly conserved tRNA structure. We found significant variation in tRNA modifying enzyme repertoires across Saccharomycotina, including lineage-specific losses, and annotated a prokaryotic-associated enzyme, tilS. Integrating genomic and sequencing data enabled us to link enzyme repertoires with tRNA gene sequences. tRNA sequencing revealed distinct modification signatures across the three focal species, and further analysis using General Linearized modelling suggested tRNA enzyme loss is associated with target tRNA nucleotide absence in gene sequences. This work provides the first integrated view of tRNA gene and modification diversity in eukaryotes and expands the field of tRNA diversity in fungi.

8

Massively parallel characterization and deep learning of enhancers in plant genomes

Jores, T.; Mueth, N. A.; Gorjifard, S.; Triesch, S.; Schirmer, D.; Tonnies, J.; Bubb, K. L.; Cuperus, J. T.; Fields, S.; Queitsch, C.

2026-04-29 plant biology 10.64898/2026.04.26.720828 medRxiv

Top 0.1%

15.0%

Show abstract

Enhancers coordinate gene expression in response to developmental and environmental cues. Because plant enhancers lack the readily detectable molecular hallmarks of animal enhancers, their systematic functional characterization has yet to be accomplished. Here, we characterize the species- and condition-specific enhancer activity of over 350,000 sequences derived from accessible chromatin regions of the Arabidopsis, tomato, maize, and sorghum genomes. Enabled by the massive scale of the data, we developed plantGREP, a deep learning model that predicts enhancer strength and identifies the underlying functional sequence motifs. We apply plantGREP to evolve strong constitutive as well as species- and condition-specific enhancers, and to locate regions with enhancer activity upstream of developmental genes in crop genomes. These results should facilitate the targeted editing of enhancers in crop genomes and the design of cell-type-specific plant enhancers.

9

Histone Modification Metapeaks are Epigenetic Landmarks Predictive of Cell State

Tanner, R. M.; Perkins, T. J.

2026-04-02 genomics 10.64898/2026.03.31.715657 medRxiv

Top 0.1%

14.8%

Show abstract

Histone modifications are a key component of the epigenetic state of a cell, and they vary widely across different cell and tissue types, conditions, and disease states. Indeed, the majority of the genome is enriched with one histone mark or another across the thousands of cellular conditions that have been studied to date. Here, we use the largest-to-date collection of histone modification ChIP-seq datasets to identify the most important sites of histone modifications genome-wide. Collected and uniformly reprocessed by the International Human Epigenome Consortium, this data includes 5339 datasets enriched at nearly one billion total peaks across 59 different major cell or tissue types and in healthy and disease conditions, for six different histone marks. We propose FindMetapeaks, a new approach to identifying histone mark metapeaks, which are genomic regions with enrichment of a mark across many samples. We show that many of these epigenetic metapeaks are strongly indicative of cell and tissue type, or are associated with other sample characteristics, and highlight key regulatory regions of the genome. However, we also show that many metapeaks contain redundant information, and that parsimonious subsets of metapeaks can be selected by machine learning to predict cell state. Our histone mark metapeak atlas provides a concise set of regions for interpreting the epigenome. Availabilityhttps://github.com/rmbioinfo83/FindMetapeaks/

10

Ancestral Genome Reconstruction.

Siguret, C.; Olivier, M.; Huneau, C.; SOW, M. D.; Stenger, P.-L.; Klopp, C.; Martin, M.-L.; Tamby, J.-P.; Civan, P.; Pont, C.; Mathieu, O.; SALSE, J.

2026-04-16 genomics 10.64898/2026.04.16.718917 medRxiv

Top 0.1%

14.6%

Show abstract

AGR, for Ancestral Genome Reconstruction, is an automatic publicly available and open-source pipeline to infer paleogenomes from modern species genome comparisons exploiting the concept of inter-species chromosomal synteny relationships hierarchical clustering that can be used to unveil how ancestral genomes, genes, sequences and functions have been shaped during million years of present-day plant evolution.

11

pyTrance finds co-localizing RNAs in subcellular spatial transcriptomics data

Strenger, L.; Cerda-Jara, C. A.; Karaiskos, N.; Rajewsky, N.

2026-05-11 bioinformatics 10.64898/2026.05.07.723470 medRxiv

Top 0.1%

14.6%

Show abstract

Regulation of RNA subcellular localization is crucial for cellular functions in health and disease. For example, local translation of co-localized RNAs is crucial for neural biology. However, it is challenging to identify RNA co-localization events. Here, we present pyTrance, a computational framework that predicts and quantifies subcellular RNA co-localization from spatial transcriptomics data, leveraging latent embeddings learned by a graph neural network. Based on extensive benchmarking, detection of co-localizing RNAs was more accurate and robust compared to existing methods. In mouse brain tissue, pyTrance found several RNA co-localization patterns. Co-localized RNAs were often functionally related and validated by biological knowledge. Interestingly, among novel patterns, pyTrance identified co-localization of GABAergic markers, including Gad1, in neuronal projections. Experimental validation led to the discovery of a spatial overlap between Gad1 mRNA/protein, strongly suggesting local translation. Our results establish pyTrance as a state-of-the-art method to discover biologically important RNA co-localization at subcellular resolution.

12

Informational blueprints reveal condition-dependent gene regulatory architectures

Gokmen, D. E.; Pan, R. W.; Roeschinger, T.; Quake, S.; Garcia, H.; Phillips, R.; Vitelli, V.

2026-05-20 genomics 10.64898/2026.05.18.726006 medRxiv

Top 0.1%

14.5%

Show abstract

While coding regions in the genome have a direct interpretation in terms of protein products, significant fractions are non-coding and yet control essential biological functions. Unlike the genetic code, there is no "lookup table" that identifies where regulatory proteins, known as transcription factors (TFs), bind. Here, we extract these binding sites by distilling sequences of nucleotide letters into collective coordinates (hyperletters) representing the binding sites that are active under specific environmental conditions. Going beyond local information footprints between individual bases and expression levels, our information blueprint algorithm compresses the global information by optimising filters that simultaneously scan an entire promoter sequence. Inspired by renormalisation-group techniques, we identify TF binding sites as coarse-grained variables combining groups of correlated mutations with the highest collective impact on gene expression. We validate our approach on experimental data for E. coli and discover novel regulatory elements illustrating its deployment at scale across growth conditions.

13

Benchmarking ambient RNA removal across droplet and well-plate platforms reveals artificial count generation as a critical failure mode of scAR and CellClear

Schroeder, L.; Gerber, S.; Ruffini, N.

2026-04-10 bioinformatics 10.64898/2026.04.08.717130 medRxiv

Top 0.1%

14.5%

Show abstract

BackgroundAmbient RNA contamination is a pervasive artifact of single-cell and single-nucleus RNA sequencing (sxRNA-seq), yet no consensus exists on which computational removal tool performs best across experimental platforms. ResultsWe present a systematic benchmark of six tools: CellBender, DecontX, SoupX, scCDC, scAR, and CellClear - evaluated across six human-mouse cell line mixing (hgmm) datasets (1k-20k cells) providing partial ground truth, two droplet-based complex tissue datasets (PBMC scRNA-seq; prefrontal cortex snRNA-seq), and a well-plate-based dataset (BD Rhapsody WBC). Using inter-species counts as partial ground truth, we quantify sensitivity, specificity, precision, and removal consistency per tool. We further apply a count-integrity criterion quantifying gene-cell positions where corrected values exceed raw counts. This reveals that scAR and CellClear do not merely denoise but fundamentally restructure count matrices: CellClear replaces >93% of counts with values derived from matrix factorization, while scAR generates spurious cell types absent from uncorrected data, including three spurious coarse cell types in the BD Rhapsody dataset and up to eight novel cell types in the prefrontal cortex. CellBender and SoupX exhibit reliable contamination removal with minimal count distortion. DecontX and scCDC are the only tools operable on non-droplet platforms without raw count matrix access. Runtime benchmarking at atlas scale (up to 172,000 nuclei) further demonstrates that CellClear fails to scale. ConclusionsCount matrix integrity, not removal sensitivity alone, must be a primary criterion when selecting ambient RNA correction tools. We provide platform-specific recommendations and a decision framework to guide tool selection across experimental contexts.

14

vcfilt: A Zero-Allocation Streaming Filter for High-Throughput VCF Processing

KP, M. M.

2026-04-16 bioinformatics 10.64898/2026.04.14.718370 medRxiv

Top 0.1%

14.5%

Show abstract

Variant Call Format (VCF) files are the dominant interchange format for genomic variant data, but their size - routinely exceeding tens of gigabytes for population-scale studies - creates a significant computational bottleneck at the quality-filtering stage. Existing tools such as bcftools and vcftools provide broad functionality through general-purpose expression engines, but incur substantial per-record overhead from dynamic field lookup, type resolution, and heap allocation. We present vcfilt, a streaming, batch-parallel VCF filter implemented in Go that restricts its scope to three high-frequency filter criteria (INFO/DP, INFO/AF, and QUAL) and applies them via a zero-allocation byte-scan parser. Benchmarked on real 1000 Genomes Project data (chromosome 20, 1,811,146 variants), vcfilt achieves 147,000 variants/second on an 18 GB plain-text VCF file using a single thread - a 12.2x speedup over bcftools 1.18 under identical conditions. On gzip-compressed input, the speedup is 7.9x. Output is byte-for-byte identical to bcftools across all tested filter combinations. vcfilt is distributed as a self-contained static binary, a Docker image, and a Singularity-compatible container. The source code and all benchmark scripts are openly available under the MIT licence.

15

Systematic identification of seed-driven off-target effects in Perturb-seq experiments

Hartman, A.; Blair, J. D.; Nguyen, T. P.; Dyson, K.; Bradu, A.; Takacsi-Nagy, O.; Santostefano, K.; Boade, T.; Bolanos, M.; Zhu, R.; Dann, E.; Marson, A.; Gitler, A.; Satija, R.; Satpathy, A. T.; Roth, T. L.

2026-03-28 genomics 10.64898/2026.03.27.714658 medRxiv

Top 0.1%

14.5%

Show abstract

Genome-wide Perturb-seq (GWPS) has emerged as a powerful approach for unbiased mapping of gene regulatory networks. A key assumption underlying many Perturb-seq analyses is that each guide RNA exclusively perturbs a single target locus. Without methods to identify and filter off-target events, erroneous gene-pathway associations driven by off-target activity can propagate into downstream analyses. Here, we present a workflow for the systematic identification of candidate off-target events in CRISPRi Perturb-seq experiments. Our approach exploits the observation that cells harboring a guide which represses an off-target gene display transcriptional similarity to cells in which that gene is directly targeted by an on-target guide. We apply our workflow to multiple GWPS datasets and nominate off-target events in which a guide nominally targeting one gene also represses a distinct gene producing a phenotype likely attributable to the off-target perturbation. We use both off-target gene repression and guide seed sequence alignments at the off-target promoter locus as evidence for off-target effects and find independent evidence of putative off-target events in separate GWPS datasets. Together, these results establish a principled framework for the identification and filtering of off-target guide effects in Perturb-seq experiments.

16

Nohic: A Pipeline For Plant Contig Scaffolding Using Personalized References From Pangenome Graphs

Nguyen-Hoang, A.; Arslan, K.; Kopalli, V.; Windpassinger, S.; Perovic, D.; Stahl, A.; Golicz, A.

2026-03-19 bioinformatics 10.64898/2026.03.17.712436 medRxiv

Top 0.1%

14.4%

Show abstract

Hi-C data is commonly used for reference-free de novo scaffolding. However, with the rapid increase in high-quality reference genomes, reference-guided workflows are now more practical for assembling large numbers of target genomes without relying on costly and labor-intensive Hi-C sequencing. Recently, a pangenome graph-based haplotype sampling algorithm was introduced to generate personalized graphs for target genomes. Such graphs have strong potential as references for reference-guided contig scaffolding. Here, we present noHiC, a reference-guided scaffolding pipeline supporting key steps of plant contig scaffolding. A distinctive feature of noHiC is the nohic-refpick script, generating a best-fit synthetic reference (synref) from a pangenome graph that is genetically close to the target contigs. This enables the integration of genetic information from many references (up to 48 in our tests) without using them separately during scaffolding. Synrefs showed advantages over highly contiguous conventional references in reducing false contig breaking during reference-based correction. Additionally, nohic-refpick can be combined with fast scaffolders (ntJoin) to rapidly produce highly contiguous assemblies using synrefs derived from pangenome graphs. The noHiC pipeline, used alone or in combination with ntJoin, can generally produce assemblies that are structurally consistent with public Hi-C-based or manually curated genomes. The pipeline is publicly available at https://github.com/andyngh/noHiC. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=82 SRC="FIGDIR/small/712436v1_ufig1.gif" ALT="Figure 1"> View larger version (9K): org.highwire.dtl.DTLVardef@40bd8forg.highwire.dtl.DTLVardef@5d2bbborg.highwire.dtl.DTLVardef@e214a3org.highwire.dtl.DTLVardef@b90b06_HPS_FORMAT_FIGEXP M_FIG C_FIG

17

DNAharvester: A Nextflow Pipeline for Analysing Highly Degraded DNA from Ancient and Historical Specimens

Sharif, B.; Kutschera, V. E.; Oskolkov, N.; Guinet, B.; Lord, E.; Chacon-Duque, J. C.; Oppenheimer, J.; van der Valk, T.; Diez-del-Molino, D.; D. Heintzman, P.; Dalen, L.

2026-04-21 bioinformatics 10.64898/2026.04.20.719564 medRxiv

Top 0.1%

14.4%

Show abstract

Ancient DNA (aDNA) research has advanced rapidly with the development of high-throughput sequencing, which now enables genome-wide analyses of large collections of prehistoric specimens. However, analysing palaeontological and archaeological material with highly degraded DNA constitutes a major bioinformatic challenge. DNA from such samples is characterised by short fragment lengths, low endogenous content, post-mortem damage, and considerable cross-species contamination, which can increase spurious mapping and reference bias, affecting downstream population genetic inferences. Here we present DNAharvester, a modular and reproducible pipeline designed specifically for the processing of highly degraded DNA from ancient and historical specimens. DNAharvester integrates metagenomic filtering before mapping, competitive mapping, adaptive aligner selection (incorporating algorithms such as BWA-aln, BWA-mem, and Bowtie2), and systematic evaluation of reference bias and spurious mapping. By incorporating flexible mapping and filtering strategies, the pipeline can be adapted to varying sample preservation, with a distinct focus on maximising authentic data recovery from highly degraded material. Furthermore, DNAharvester features comprehensive subworkflows for iterative assembly of mitogenomes, identification of genomic repeats and CpG sites, taxonomic classification, microbial/pathogen screening of unmapped reads, genetic sex determination, and variant calling for downstream analyses. To accommodate datasets with varying sequencing depths, the pipeline incorporates multiple variant calling strategies, including diploid variant calling, genotype likelihood estimation, and pseudo-haploid random allele calling. Implemented in Nextflow, DNAharvester provides a highly scalable, containerised framework that enhances reproducibility, portability, and robustness in aDNA analyses. We validated the pipeline across a gradient of simulated scenarios and empirical datasets, demonstrating its ability to systematically mitigate complex background contamination while preserving authentic genomic signals even in the most challenging of circumstances. By streamlining complex bioinformatic tasks through simple configuration files, DNAharvester establishes a standardised approach for the rigorous analysis of highly degraded DNA datasets and makes genomic analyses of ancient remains accessible to the broader research community.

18

Building computational benchmarks: an Omnibenchmark reimplementation of a single-cell preprocessing pipeline evaluation

Choudhury, A.; Kitak, T.; Carrillo, B.; Busch, P.; Emons, M.; Gunz, S.; Koderman, M.; Luo, S.; Mallona, I.; Meara, A.; Wissel, D.; Robinson, M. D.

2026-05-05 bioinformatics 10.64898/2026.05.01.722166 medRxiv

Top 0.1%

14.3%

Show abstract

In the past few years, we have seen a veritable surge in single-cell (e.g., RNA sequencing) techniques and datasets, enabling increasingly detailed characterization of cellular heterogeneity across tissues and conditions. This surge in single-cell techniques has been complemented by a large number of analysis frameworks and pipelines, and a large parameter space and researcher degrees of freedom to use them. Many neutral benchmarks have been presented for various computational tasks, but most make design decisions that render them incompatible with each other, e.g., different datasets and metrics, or parameter sets used. In this work, we showcase a recently developed framework, Omnibenchmark, to build reproducible, extensible and standardized method comparisons. This not only facilitates the broad investigation of pipelines used in single-cell data analysis, but also highlights how the process of building benchmarks can be streamlined and unified. We do this as an initial proof-of-principle for an arms-length benchmark that evaluates five single-cell RNA sequencing pipelines (filtering to normalization to dimensionality reduction to clustering) on three datasets. This standardization enables benchmarks to be easily extended in several directions, including broader parameter sweeps, comparisons across software versions and architectures, isolation of pipeline steps, and integration of additional pipelines, datasets, and metrics.

19

Tandem: a bioinformatics tool for detection, mechanism classification, and population quantification of bacterial tandem gene duplications

Ngan, W. Y.; Smith, E. S. J.

2026-05-26 bioinformatics 10.64898/2026.05.22.727201 medRxiv

Top 0.1%

14.3%

Show abstract

MotivationTandem gene duplication drives antibiotic resistance, metabolic adaptation, and gene-family expansion in bacteria, but no tool detects them in reference genomes, discovers their junctions in isolate sequencing, and quantifies the junctions in population samples. Existing callers (e.g. breseq) detect duplications without classifying formation mechanisms and often fail to quantify the duplication. ResultsTandem has 3 modules. Module 1 detects reference-genome duplications by NUCmer self-alignment and classifies each by homologous-recombination signature and the junction microhomology length. Module 2 confirms junctions in whole-genome sequencing at user-nominated coordinates after user inspecting the coverage plot. Module 3 quantifies known junction in population sequencing using the novel Junction Read Ratio (JRR). On 280 artificial population tests across seven bacterial species, Tandem achieves 100% recall and 4.3% mean absolute error. Applied to experimentally evolved Pseudomonas fluorescens SBW25 populations, Tandem resolves multiple co-segregating duplication fragments. AvailabilitySource code, documentation, and test data are available under the MIT License at https://github.com/yuingan/tandem. Implemented in Python 3. Requires NUCmer (MUMmer4), minimap2, and samtools.

20

Using Mapping-Profiles to Refine Strain-Level Metagenomic Classification

Lipovac, J.; Angevin, L.; Krizanovic, K.

2026-05-20 bioinformatics 10.64898/2026.05.18.725856 medRxiv

Top 0.1%

14.3%

Show abstract

Metagenomic classification at the strain level remains challenging due to high sequence similarity among closely related genomes, which leads to ambiguous read mappings and frequent false-positive strain detections. Reducing such errors improves the reliability of strain-level analyses, which is critical for applications such as pathogen detection. We introduce StrainRefine, a post-mapping refinement method that analyzes read-reference mapping profiles to resolve ambiguous assignments among highly similar genomes. The method represents candidate reference genomes using binary profiles that capture read-support patterns and measures similarity between references based on profile overlap. The method clusters references based on similar mapping profiles, filters weakly supported genomes, and reassigns reads to representative references, reducing redundant reporting of near-identical strains. StrainRefine substantially reduces false-positive strain detections while preserving recall and improving agreement between predicted and true abundance profiles. On large-scale metagenomic datasets, it achieves a substantially improved precision-recall balance compared to existing mapping-based approaches, with the standalone method obtaining the highest read-level classification accuracy on the most complex evaluated dataset. Unlike many strain-level tools designed for individual species, StrainRefine operates without prior assumptions about sample composition or curated species-specific reference collections, while still achieving comparable performance in single-species settings on species-specific reference databases. These results highlight mapping-profile similarity as an effective signal for improving strain-level metagenomic classification.